-
Notifications
You must be signed in to change notification settings - Fork 63
Fix sorting bug for named entities #394
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Prafulla Mahindrakar <prafulla.mahindrakar@gmail.com>
Codecov Report
@@ Coverage Diff @@
## master #394 +/- ##
==========================================
+ Coverage 60.80% 61.38% +0.57%
==========================================
Files 151 154 +3
Lines 10799 11088 +289
==========================================
+ Hits 6566 6806 +240
- Misses 3529 3578 +49
Partials 704 704
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Signed-off-by: Prafulla Mahindrakar <prafulla.mahindrakar@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the same issue occur in other group by queries like ListWorkflowIds, ListLaunchPlanIds, etc - should we fix those too?
|
||
// Apply consistent sort ordering. | ||
if input.SortParameter != nil { | ||
identifierGroupByWithOrderKey := fmt.Sprintf("%s, %s, %s, %s", Project, Domain, Name, input.SortParameter.GetSortKey()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check that the sort parameter isn't already in (Project, Domain, Name)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have removed this functionality of dynamically adding the column and instead made it part of the identifierGroupBy which takes care of fixing ListWorkflowIds , ListLaunchPlanIds, ListTaskIdentifiers
Signed-off-by: Prafulla Mahindrakar <prafulla.mahindrakar@gmail.com>
Signed-off-by: Prafulla Mahindrakar <prafulla.mahindrakar@gmail.com>
Signed-off-by: Prafulla Mahindrakar <prafulla.mahindrakar@gmail.com>
@@ -30,7 +31,7 @@ const taskTableName = "tasks" | |||
const limit = "limit" | |||
const filters = "filters" | |||
|
|||
var identifierGroupBy = fmt.Sprintf("%s, %s, %s", Project, Domain, Name) | |||
var identifierGroupBy = fmt.Sprintf("%s, %s, %s, %s", Project, Domain, Name, CreatedAt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we always group by created at now? what if someone queries by updated_at or some other field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Facing issue with group by . Let me dig deeper.
I am planning to remove the group by instead do the following queries which should be equivalent
explain SELECT
Distinct
entities.project,
entities.domain,
entities.name,
'2' AS resource_type,
named_entity_metadata.description,
named_entity_metadata.state
FROM "named_entity_metadata"
RIGHT JOIN (
select distinct project,domain,name
from (
SELECT project,domain,name, created_at
FROM "tasks"
WHERE "domain" = 'development' AND "project" = 'admintests' order by created_at
) t limit 2 offset 2) AS entities ON named_entity_metadata.resource_type = 2
AND named_entity_metadata.project = entities.project
AND named_entity_metadata.domain = entities.domain
AND named_entity_metadata.name = entities.name
WHERE COALESCE(named_entity_metadata.state, 0) = '0' ;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=9.82..9.84 rows=1 width=581)
-> Sort (cost=9.82..9.83 rows=1 width=581)
Sort Key: tasks.project, tasks.domain, tasks.name, named_entity_metadata.description, named_entity_metadata.state
-> Nested Loop Left Join (cost=1.76..9.81 rows=1 width=581)
Join Filter: ((named_entity_metadata.project = tasks.project) AND (named_entity_metadata.domain = tasks.domain) AND (named_entity_metadata.name = tasks.name))
Filter: (COALESCE(named_entity_metadata.state, 0) = 0)
-> Limit (cost=1.61..1.62 rows=1 width=29)
-> HashAggregate (cost=1.59..1.61 rows=2 width=29)
Group Key: tasks.project, tasks.domain, tasks.name
-> Sort (cost=1.41..1.44 rows=9 width=37)
Sort Key: tasks.created_at
-> Seq Scan on tasks (cost=0.00..1.27 rows=9 width=37)
Filter: ((domain = 'development'::text) AND (project = 'admintests'::text))
-> Index Scan using named_entity_metadata_type_project_domain_name_idx on named_entity_metadata (cost=0.14..8.16 rows=1 width=616)
Index Cond: (resource_type = 2)
(15 rows)
Also the query plan look more efficient
One with group by
explain SELECT
entities.project,
entities.domain,
entities.name,
'2' AS resource_type,
named_entity_metadata.description,
named_entity_metadata.state
FROM "named_entity_metadata"
RIGHT JOIN (SELECT project,domain,name FROM "workflows" WHERE "domain" = 'development' AND "project" = 'admintests' GROUP BY project, domain, name, created_at ORDER BY created_at LIMIT 2 offset 2) AS entities ON named_entity_metadata.resource_type = 2
AND named_entity_metadata.project = entities.project
AND named_entity_metadata.domain = entities.domain
AND named_entity_metadata.name = entities.name
WHERE COALESCE(named_entity_metadata.state, 0) = '0'
GROUP BY "created_at",entities.project, entities.domain, entities.name, named_entity_metadata.description, named_entity_metadata.state
ORDER BY created_at desc;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Group (cost=9.70..9.72 rows=1 width=589)
Group Key: named_entity_metadata.created_at, workflows.project, workflows.domain, workflows.name, named_entity_metadata.description, named_entity_metadata.state
-> Sort (cost=9.70..9.71 rows=1 width=557)
Sort Key: named_entity_metadata.created_at DESC, workflows.project, workflows.domain, workflows.name, named_entity_metadata.description, named_entity_metadata.state
-> Nested Loop Left Join (cost=1.58..9.69 rows=1 width=557)
Join Filter: ((named_entity_metadata.project = workflows.project) AND (named_entity_metadata.domain = workflows.domain) AND (named_entity_metadata.name = workflows.name))
Filter: (COALESCE(named_entity_metadata.state, 0) = 0)
-> Limit (cost=1.44..1.46 rows=2 width=37)
-> Group (cost=1.41..1.53 rows=9 width=37)
Group Key: workflows.created_at, workflows.project, workflows.domain, workflows.name
-> Sort (cost=1.41..1.44 rows=9 width=37)
Sort Key: workflows.created_at, workflows.name
-> Seq Scan on workflows (cost=0.00..1.27 rows=9 width=37)
Filter: ((domain = 'development'::text) AND (project = 'admintests'::text))
-> Materialize (cost=0.14..8.17 rows=1 width=624)
-> Index Scan using named_entity_metadata_type_project_domain_name_idx on named_entity_metadata (cost=0.14..8.16 rows=1 width=624)
Index Cond: (resource_type = 2)
We dont need an additional sort for the same key in both inner and outer sql queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a note we still need something equivalent to the WHERE COALESCE(named_entity_metadata.state, 0) = '0'
clause to filter by named entity state
re:
RIGHT JOIN (
select distinct project,domain,name
from (
SELECT project,domain,name, created_at
FROM "tasks"
WHERE "domain" = 'development' AND "project" = 'admintests' order by created_at
) t limit 2 offset 2) AS entities ON named_entity_metadata.resource_type = 2
why do we need to do the nested select? can't we right join directly on the result of SELECT DISTINCT project, domain, name, created_at
Signed-off-by: Prafulla Mahindrakar prafulla.mahindrakar@gmail.com
TL;DR
The column used in ordering should be part of the group by clause aswell and hence have added the created_at column too.
This wasn't happening for created_at column which was absent.
Only the record sets cannot be orderd by state since those dont exist workflow table.
With the fix :
Type
Are all requirements met?
Complete description
How did you fix the bug, make the feature etc. Link to any design docs etc
Tracking Issue
fixes flyteorg/flyte#2327
Follow-up issue
NA